Okay. Welcome to the last week of AI. So you probably think this was the last quiz. It's
not because otherwise you wouldn't have kind of you would have one week of stuff that's not
covered by quizzes and that's bad for preparation. So we will have some kind of a quiz, some kind of
quiz probably today in a week, just the normal time unless something is wrong with that from
your perspective. And in any case it's optional of course. And then there will be a kind of a
long-running quiz that does this survey. It's kind of giving you an incentive to do our survey
that you've probably seen last semester. So two more quizzes that will bring up the total to
12 which makes the kind of best 10 out of more reasonable than best 10 out of 10.
And so yeah that's what's going to happen. Are there any questions about the exam apart from
when is it going to be? I had an email promising me that there will be a resolution this week
even though the person in charge is officially still in the hospital and we'll see. I'm not
holding my breath but I hope this is going to be resolved soon. Yeah I apologize it's beyond my
control. Are there any other questions? Yes? In the second one so the plan is to do it end of
September and I think the target days are something like 25th 26th or something in that direction.
Whether that's possible I don't know because you can imagine that that might be crowded there.
And they're trying to not have as many overlaps and so we'll see. Any more questions? Right so
we're talking about natural language. It's kind of one of the big AI subtopics because it seems
that languages are somehow correlated with intelligence so we should be able to do it. And
we've basically looked at in the last week we've looked at why or how language is non-trivial and
we are kind of now looking at technologies to do something with language. Where the stuff we've
basically seen before we've looked at language models which are essentially distributions of
n-grams typically and they allow us to do certain things. We're now looking at a very important kind
of in a way of pre-processing techniques. It's called part of speech tagging and the idea there
is that there are loads of words but some things really only depend on the sequence of categories.
Many many words hundreds of thousands in English alone will give our distributions problems and so
sometimes we can get better by just looking at the types of words and their the probabilities
of their sequences that can already give us a lot of information and the types of words are some
sometimes you can look them up in the in the lexicon. There are surprisingly few verbs so if
you know what a verb is then that is well known right there's only five six thousand or so
essentially in English and sometimes you can kind of just looking at the shape of the world word
kind of see what kind of a word it is and stuff like that. So part of speech tagging is something
relatively simple you define a set of categories and then like preposition determiners nouns
prepos... yeah that's a special kind of a verb and have a determiner you know noun preposition
again and so on and basically to every word you assign a category. One thing that is often overlooked
is that in this example the comma is treated like a word right we're tokenizing it as a word and
we're giving it the category comma. Imagine what the category of full stop is. Okay and so in and of
itself very boring as a first step for certain natural language processing techniques absolutely
essential. So the question is how do we how do we do this? I'd say we have a corpus of a billion
text words that's what a good size newspaper text corpus is about. How do we do that? How do we do
that efficiently? Applications of part of speech tagging is for instance pronunciation. If you know
the class then you can disambiguate and get the pronunciation right in many cases because typically
what word it is often kind of is disambiguated by its part of speech and so that's one of the
things of the ones if you think of the word record or record right if you record a sound
the verb is pronounced differently than a record things like that right and it's something that's
received a lot of attention a lot of engineering and you can imagine that if you want to build
if you want to build part of speech taggers what you do is they actually have a corpus in
which you let humans do it and then you use machine learning algorithms to actually do this.
There's one corpus that's been doing a lot that's the Pen Tree Bank which means which is the linguist
at the University of Pennsylvania they acquired a newswire corpus and then kept lots of linguistic
students from starving by letting them do manual part of speech tagging to finance their studies.
Presenters
Zugänglich über
Offener Zugang
Dauer
01:26:40 Min
Aufnahmedatum
2025-07-22
Hochgeladen am
2025-07-23 18:19:07
Sprache
en-US